Geavanceerde data technieken

Gerko Vink

Methodology & Statistics @ Utrecht University

10 Jun 2025

Disclaimer

I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

These materials are generated by Gerko Vink, who holds the copyright. The intellectual property belongs to Utrecht University. Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.

Warning

You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:

  • You must ensure that the content is not used for further training of the model

Slide materials and source code

Materials

Anatomy of an Answer

Terms I may use

  • TDGM: True data generating model
  • DGP: Data generating process, closely related to the TDGM, but with all the wacky additional uncertainty
  • Truth: The comparative truth that we are interested in
  • Bias: The distance to the comparative truth
  • Variance: When not everything is the same
  • Estimate: Something that we calculate or guess
  • Estimand: The thing we aim to estimate and guess
  • Population: That larger entity without sampling variance
  • Sample: The smaller thing with sampling variance
  • Incomplete: There exists a more complete version, but we don’t have it
  • Observed: What we have
  • Unobserved: What we would also like to have

At the start

Let’s start with the core:

Statistical inference

Statistical inference is the process of drawing conclusions from truths

Truths are boring, but they are convenient.

  • however, for most problems truths require a lot of calculations, tallying or a complete census.
  • therefore, a proxy of the truth is in most cases sufficient
  • An example for such a proxy is a sample
  • Samples are widely used and have been for a long timeSee Jelke Bethlehem’s CBS discussion paper for an overview of the history of sampling within survey statistics

Do we need data?

Without any data we can still come up with a statistically valid answer.

  • The answer will not be very informative.
  • In order for our answer to be more informative, we need more information

Some sources of information can already tremendously guide the precision of our answer.

In Short

Information bridges the answer to the truth. Too little information may lead you to a false truth.

Being wrong about the truth

  • The population is the truth
  • The sample comes from the population, but is generally smaller in size
  • This means that not all cases from the population can be in our sample
  • If not all information from the population is in the sample, then our sample may be wrong

Good questions to ask yourself

  1. Why is it important that our sample is not wrong?
  2. How do we know that our sample is not wrong?

Solving the missingness problem

  • There are many flavours of sampling
  • If we give every unit in the population the same probability to be sampled, we do random sampling
  • The convenience with random sampling is that the missingness problem can be ignored
  • The missingness problem would in this case be: not every unit in the population has been observed in the sample

Hmmm…

Would that mean that if we simply observe every potential unit, we would be unbiased about the truth?

Sidestep

  • The problem is a bit larger

  • We have three entities at play, here:

    1. The truth we’re interested in
    2. The proxy that we have (e.g. sample)
    3. The model that we’re running
  • The more features we use, the more we capture about the outcome for the cases in the data

  • The more cases we have, the more we approach the true information


    All these things are related to uncertainty. Our model can still yield biased results when fitted to \(\infty\) features. Our inference can still be wrong when obtained on \(\infty\) cases.

Sidestep

  • The problem is a bit larger

  • We have three entities at play, here:

    1. The truth we’re interested in
    2. The proxy that we have (e.g. sample)
    3. The model that we’re running
  • The more features we use, the more we capture about the outcome for the cases in the data

  • The more cases we have, the more we approach the true information


Core assumption: all observations are bonafide

Uncertainty simplified

When we do not have all information …

  1. We need to accept that we are probably wrong
  2. We just have to quantify how wrong we are


In some cases we estimate that we are only a bit wrong. In other cases we estimate that we could be very wrong. This is the purpose of testing.

The uncertainty measures about our estimates can be used to create intervals

Confidence in the answer

An intuitive approach to evaluating an answer is confidence. In statistics, we often use confidence intervals. Discussing confidence can be hugely informative!

If we sample 100 samples from a population, then a 95% CI will cover the true population value at least 95 out of 100 times.

  • If the coverage <95: bad estimation process with risk of errors and invalid inference
  • If the coverage >95: inefficient estimation process, but correct conclusions and valid inference. Lower statistical power.

How do we know that our sample is not….

We can replicate our sample.

  • A replication would be a new sample from the same population or true data generating model obtained by the same data generating process.
  • If we would sample 100 times, we would get 100 different samples
  • If we would estimate 100 times, we would get 100 different estimates with 100 different confidence intervals (e.g. 95% CI)
  • Out of these 100 different intervals, we would expect a nominal coverage. For a 95% CI we’d expect 95 of them to cover the true population value.

This is a lot of work…

Full sampling validation of a model’s inferences is a lot of work.

  • it is the most robust way of obtaining inferential validity
  • it is not always necessary

Under some general assumptions, we can use the same data to validate our model’s inferences and predictions.

  • these assumptions can be met in practice
  • but as soon as assumptions are made, we open the door to errors when these assumptions do not hold

Assumptions

Take the following definition:

a thing that is accepted as true or as certain to happen, without proof.

Assumptions are a statisticians faith. It is often impossible to prove that they hold in practice, but we choose to believe that they do.

Sensitivity analyses

I often use computational evaluation techniques to quantify the scope of the impact of assumptions made. For example, we can test the effect of violating assumptions on our results. We then verify if the inferences are sensitive to violations of the assumptions. We can even verify the extend of when assumptions start becoming influential to our inferences.

The holy trinity

Whenever I evaluate something, I tend to look at three things:

  • bias (how far from the truth)
  • uncertainty/variance (how wide is my interval)
  • coverage (how often do I cover the truth with my interval)


As a function of model complexity in specific modeling efforts, these components play a role in the bias/variance tradeoff

On the individual level

Individual intervals can also be hugely informative!

Individual intervals are generally wider than confidence intervals

  • This is because it covers inherent uncertainty in the data point on top of sampling uncertainty

Be careful

Narrower intervals mean less uncertainty.

It does not mean that the answer is correct!

Case: Spaceshuttle Challenger

36 years ago, on 28 January 1986, 73 seconds into its flight and at an altitude of 9 miles, the space shuttle Challenger experienced an enormous fireball caused by one of its two booster rockets and broke up. The crew compartment continued its trajectory, reaching an altitude of 12 miles, before falling into the Atlantic. All seven crew members, consisting of five astronauts and two payload specialists, were killed.

Nothing happened, so we ignored it

In the decision to proceed with the launch, there was a presence of dark data. And no-one noticed!

Dark data
Information that is not available but necessary to arrive at the correct answer.

This missing information has the potential to mislead people. The notion that we can be misled is essential because it also implies that artificial intelligence can be misled!

If you don’t have all the information, there is always the possibility of drawing an incorrect conclusion or making a wrong decision.

In Practice

We now have a new problem:

  • we do not have the whole truth; but merely a sample of the truth
  • we do not even have the whole sample, but merely a sample of the sample of the truth.

What would be a simple solution to allowing for valid inferences on the incomplete sample? Would that solution work in practice?

How to fix the missingness problem

There are two sources of uncertainty that we need to cover when analyzing incomplete data:

  1. Uncertainty about the data values we don’t have:
    when we don’t know what the true observed value should be, we must create a distribution of values with proper variance (uncertainty).
  2. Uncertainty about the process that generated the values we do have:
    nothing can guarantee that our sample is the one true sample. So it is reasonable to assume that the parameters obtained on our sample are biased.

A straightforward and intuitive solution for analyzing incomplete data in such scenarios is multiple imputation (Rubin, 1987).

Multiple imputation with mice

Inspect the missingness

library(mice)
library(ggmice)
plot_pattern(boys)

Impute boys

imp <- mice(boys)

 iter imp variable
  1   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
  1   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
  1   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
  1   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
  1   5  hgt  wgt  bmi  hc  gen  phb  tv  reg
  2   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
  2   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
  2   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
  2   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
  2   5  hgt  wgt  bmi  hc  gen  phb  tv  reg
  3   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
  3   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
  3   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
  3   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
  3   5  hgt  wgt  bmi  hc  gen  phb  tv  reg
  4   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
  4   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
  4   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
  4   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
  4   5  hgt  wgt  bmi  hc  gen  phb  tv  reg
  5   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
  5   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
  5   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
  5   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
  5   5  hgt  wgt  bmi  hc  gen  phb  tv  reg